DESCRIPTION
| Variables | Description |
|---|---|
| Second mortgage: | Households with a second mortgage statistics |
| Home equity: | Households with a home equity loan statistics |
| Debt: | Households with any type of debt statistics |
| Mortgage Costs: | Statistics regarding mortgage payments, home equity loans, utilities, and property taxes |
| Home Owner Costs: | Sum of utilities, and property taxes statistics |
| Gross Rent: | Contract rent plus the estimated average monthly cost of utility features |
| High school Graduation: | High school graduation statistics |
| Population Demographics: | Population demographics statistics |
| Age Demographics: | Age demographic statistics |
| Household Income: | Total income of people residing in the household |
| Family Income: | Total income of people related to the householder |
Data Import and Preparation:
Exploratory Data Analysis (EDA):
4.Perform debt analysis. You may take the following steps:
a) Explore the top 2,500 locations where the percentage of households with a second mortgage is the highest and percent ownership is above 10 percent. Visualize using geo-map. You may keep the upper limit for the percent of households with a second mortgage to 50 percent
b) Use the following bad debt equation: Bad Debt = P (Second Mortgage ∩ Home Equity Loan) Bad Debt = second_mortgage + home_equity - home_equity_second_mortgage c) Create pie charts to show overall debt and bad debt
d) Create Box and whisker plot and analyze the distribution for 2nd mortgage, home equity, good debt, and bad debt for different cities
e) Create a collated income distribution chart for family income, house hold income, and remaining income
Exploratory Data Analysis (EDA):
Perform EDA and come out with insights into population density and age. You may have to derive new fields (make sure to weight averages for accurate measurements):
a) Use pop and ALand variables to create a new field called population density
b) Use male_age_median, female_age_median, male_pop, and female_pop to create a new field called median age
c) Visualize the findings using appropriate chart type
Create bins for population into a new variable by selecting appropriate class interval so that the number of categories don’t exceed 5 for the ease of analysis.
a) Analyze the married, separated, and divorced population for these population brackets
b) Visualize using appropriate chart type
Data Pre-processing:
The economic multivariate data has a significant number of measured variables. The goal is to find where the measured variables depend on a number of smaller unobserved common factors or latent variables. 2. Each variable is assumed to be dependent upon a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as “specific variance” because it is specific to one variable. Obtain the common factors and then plot the loadings. Use factor analysis to find latent variables in our dataset and gain insight into the linear relationships in the data. Following are the list of latent variables:
• Highschool graduation rates
• Median population age
• Second mortgage statistics
• Percent own
• Bad debt expense
Data Modeling:
Build a linear Regression model to predict the total monthly expenditure for home mortgages loan. Please refer ‘deplotment_RE.xlsx’. Column hc_mortgage_mean is predicted variable. This is the mean monthly mortgage and owner costs of specified geographical location. Note: Exclude loans from prediction model which have NaN (Not a Number) values for hc_mortgage_mean.
a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step.
b) Run another model at State level. There are 52 states in USA.
c) Keep below considerations while building a linear regression model. Data Modeling :
• Variables should have significant impact on predicting Monthly mortgage and owner costs
• Utilize all predictor variable to start with initial hypothesis
• R square of 60 percent and above should be achieved
• Ensure Multi-collinearity does not exist in dependent variables
• Test if predicted variable is normally distributed
a) Box plot of distribution of average rent by type of place (village, urban, town, etc.).
b) Pie charts to show overall debt and bad debt.
c) Explore the top 2,500 locations where the percentage of households with a second mortgage is the highest and percent ownership is above 10 percent. Visualize using geo-map.
d) Heat map for correlation matrix.
e) Pie chart to show the population distribution across different types of places (village, urban, town etc.)
Import libraries
import time
import random
from math import *
import operator
import pandas as pd
import numpy as np
# import plotting libraries
import matplotlib
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
import seaborn as sns
sns.set(style="white", color_codes=True)
sns.set(font_scale=1.5)
Import the real estate training and testing datasets
df_train=pd.read_csv("train_real estate.csv")
df_test=pd.read_csv("test_real estate.csv")
# check the column labels of the train dataframe
df_train.columns
# check the column labels of the test dataframe
df_test.columns
# view the top 3 rows of the training data
df_train.head(3)
# view the top 3 rows of the testing data
df_test.head(3)
# view statistical details of the training data
df_train.describe()
# view statistical details of the training data
df_test.describe()
# check the data types in the train dataframe
type_df = df_train.dtypes.reset_index()
type_df.columns = ['count', 'dtypes']
type_df.groupby('dtypes').agg('count').reset_index()
# check the data types in the test dataframe
type_df = df_test.dtypes.reset_index()
type_df.columns = ['count', 'dtypes']
type_df.groupby('dtypes').agg('count').reset_index()
The 'UID' column contains unique location identification values so an index can be created from the UID feature.
# use UID as the index for training dataframe
df_train.set_index(keys=['UID'],inplace=True)
# use UID as index for testing dataframe
df_test.set_index(keys=['UID'],inplace=True)
# check if UID is now the index
df_train.head(3)
# check if UID is now the index
df_test.head(3)
# checking for missing values in the training dataframe
def missing_zero_values_table(df_train):
zero_val = (df_train == 0.00).astype(int).sum(axis=0)
mis_val = df_train.isnull().sum()
mis_val_percent = 100 * df_train.isnull().sum() / len(df_train)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df_train)
mz_table['Data Type'] = df_train.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("The selected dataframe has " + str(df_train.shape[1]) + " columns and " + str(df_train.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
return mz_table
missing_zero_values_table(df_train)
# checking for missing values in the testing dataframe
def missing_zero_values_table(df_test):
zero_val = (df_test == 0.00).astype(int).sum(axis=0)
mis_val = df_test.isnull().sum()
mis_val_percent = 100 * df_test.isnull().sum() / len(df_test)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df_test)
mz_table['Data Type'] = df_test.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("The selected dataframe has " + str(df_test.shape[1]) + " columns and " + str(df_test.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
return mz_table
missing_zero_values_table(df_test)
The BLOCKID column will be dropped from the train and test datasets since it has no value i.e. it has 100% missing values
# identifying features with 0 variance in the train dataframe
df_num = df_train.loc[:,df_train.dtypes==np.int64]
temp = []
for i in df_num.columns:
if df_train[i].var()==0:
temp.append(i)
print(len(temp))
print(temp)
# identifying features with 0 variance in the test dataframe
df_num1 = df_test.loc[:,df_test.dtypes==np.int64]
temp1 = []
for i in df_num1.columns:
if df_test[i].var()==0:
temp1.append(i)
print(len(temp1))
print(temp1)
The SUMLEVEL column has 0 variance and will be dropped from the train and test datasets.
# dropping the BLOCKID and SUMLEVEL columns from the train dataframe
df_train.drop(columns=['BLOCKID','SUMLEVEL'],inplace=True)
# dropping the BLOCKID and SUMLEVEL columns from the test dataframe
df_test.drop(columns=['BLOCKID','SUMLEVEL'],inplace=True)
Missing values will be filled with the column mean for both the train and test dataframes
# filling the missing values with the mean
missing_train_cols=[]
for col in df_train.columns:
if df_train[col].isna().sum() !=0:
missing_train_cols.append(col)
print(missing_train_cols)
# filling the missing values with the mean
missing_test_cols=[]
for col in df_test.columns:
if df_test[col].isna().sum() !=0:
missing_test_cols.append(col)
print(missing_test_cols)
# Missing columns are all numerical variables
for col in df_train.columns:
if col in (missing_train_cols):
df_train[col].replace(np.nan, df_train[col].mean(),inplace=True)
# Missing columns are all numerical variables
for col in df_test.columns:
if col in (missing_test_cols):
df_test[col].replace(np.nan, df_test[col].mean(),inplace=True)
df_train.isna().sum().sum()
df_test.isna().sum().sum()
#!pip install pandasql
from pandasql import sqldf
q1 = "select place,pct_own,second_mortgage,lat,lng from df_train where pct_own>0.10 and second_mortgage<0.5 order by second_mortgage DESC LIMIT 2500;"
pysqldf = lambda q: sqldf(q, globals())
df_train_sec_mort_pct_own=pysqldf(q1)
df_train_sec_mort_pct_own.head()
# export the created dataframe for tableau part of the project
# df_train_sec_mort_pct_own.to_csv('top 2500 locations')
#!pip install plotly
import plotly.express as px
import plotly.graph_objects as go
fig = go.Figure(data=go.Scattergeo(
lat = df_train_sec_mort_pct_own['lat'],
lon = df_train_sec_mort_pct_own['lng']),
)
fig.update_layout(
geo=dict(
scope = 'north america',
showland = True,
landcolor = "rgb(212, 212, 212)",
subunitcolor = "rgb(255, 255, 255)",
countrycolor = "rgb(255, 255, 255)",
showlakes = True,
lakecolor = "rgb(255, 255, 255)",
showsubunits = True,
showcountries = True,
resolution = 50,
projection = dict(
type = 'conic conformal',
rotation_lon = -100
),
lonaxis = dict(
showgrid = True,
gridwidth = 0.5,
range= [ -140.0, -55.0 ],
dtick = 5
),
lataxis = dict (
showgrid = True,
gridwidth = 0.5,
range= [ 20.0, 60.0 ],
dtick = 5
)
),
title='Top 2,500 locations with the highest second mortgage and percent ownership above 10%')
fig.show()
second_mortgage_list=df_train['second_mortgage'].values.tolist()
home_equity_list = (df_train['home_equity']-df_train['home_equity_second_mortgage']).values.tolist()
from matplotlib_venn import venn2
def venn_diagram(a, b, labels=['Second mortgage', 'Home equity loan']):
a = list(set(a))
b = list(set(b))
only_a = len( [x for x in a if x not in b] )
only_b = len( [x for x in b if x not in a] )
a_b = len(np.intersect1d(a, b))
venn2(subsets=(only_a, only_b, a_b), set_labels=labels)
venn_diagram(second_mortgage_list, home_equity_list)
df_train['bad_debt'] = df_train['second_mortgage'] + df_train['home_equity'] - df_train['home_equity_second_mortgage']
bad_debt_list = df_train['bad_debt'].values.tolist()
good_debt_list = (df_train['debt']-df_train['bad_debt']).values.tolist()
debt_list = df_train['debt'].values.tolist()
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,10))
labels = ['Bad debt', 'Good debt']
bad_debt_pct = (df_train['bad_debt'].sum()/df_train['debt'].sum()*100)
values = [bad_debt_pct, (100-bad_debt_pct)]
explode = [0.05, 0]
colors = ['peachpuff', 'b']
ax1.pie(values, labels=labels, autopct='%.1f%%', explode=explode, colors=colors, startangle=0,
wedgeprops={"edgecolor":"0",'linewidth': 1,
'linestyle': 'dashed', 'antialiased': True})
ax1.set_title('Bad vs Good Debt')
labels = ['second mortgage', 'Home equity loan']
bad_debt_pct = (df_train['second_mortgage'].sum()/df_train['bad_debt'].sum()*100)
values = [bad_debt_pct, (100-bad_debt_pct)]
explode = [0.05, 0]
colors = ['peachpuff', 'sandybrown']
ax2.pie(values, labels=labels, autopct='%.1f%%', explode=explode, colors=colors, startangle=0,
wedgeprops={"edgecolor":"0",'linewidth': 1,
'linestyle': 'dashed', 'antialiased': True})
ax2.set_title('Second mortage vs Home equity loan')
plt.show()
df_box_hamilton=df_train.loc[df_train['city'] == 'Hamilton']
df_box_manhattan=df_train.loc[df_train['city'] == 'Manhattan']
df_box_LA=df_train.loc[df_train['city'] == 'Los Angeles']
df_box_Brooklyn=df_train.loc[df_train['city'] == 'Brooklyn']
df_box_Philadelphia=df_train.loc[df_train['city'] == 'Philadelphia']
df_box_Columbia=df_train.loc[df_train['city'] == 'Columbia']
df_box_Tucson=df_train.loc[df_train['city'] == 'Tucson']
df_box_Columbus=df_train.loc[df_train['city'] == 'Columbus']
df_box_Hollywood=df_train.loc[df_train['city'] == 'Hollywood']
df_box_city=pd.concat([df_box_hamilton, df_box_manhattan, df_box_LA, df_box_Brooklyn, df_box_Philadelphia, df_box_Columbia, df_box_Tucson, df_box_Columbus, df_box_Hollywood])
column_names = ['second_mortgage','home_equity','debt','bad_debt']
for column in column_names:
plt.figure(figsize=(15,5))
sns.boxplot(x='city', y=column, width=0.5, data=df_box_city)
plt.title(column + ' by city', fontsize=20)
plt.xlabel('', fontsize=15)
plt.ylabel(column, fontsize=15)
plt.show()
The data is skewed so the median values will be used for the distribution charts instead of mean.
To calculate the remaining income, we'll substract the debt from the family income.
plt.figure(figsize=(14,5))
sns.distplot(df_train['family_median'], label='Family income')
sns.distplot(df_train['hi_median'], label='Household income')
sns.distplot(df_train['family_median']-df_train['hc_mortgage_median'], label='remaining income')
plt.title('Income distribution chart for family, household, & remaining income')
plt.xlabel('income')
plt.legend()
plt.show()
Visualize the findings using appropriate chart type.
#plt.figure(figsize=(25,10))
fig,(ax1,ax2,ax3)=plt.subplots(3,1, figsize=(9,15))
sns.distplot(df_train['pop'],ax=ax1)
sns.distplot(df_train['male_pop'],ax=ax2)
sns.distplot(df_train['female_pop'],ax=ax3)
plt.subplots_adjust(wspace=0.8,hspace=0.8)
plt.tight_layout()
plt.show()
#plt.figure(figsize=(25,10))
fig,(ax1,ax2)=plt.subplots(2,1, figsize=(9,10))
sns.distplot(df_train['male_age_mean'],ax=ax1)
sns.distplot(df_train['female_age_mean'],ax=ax2)
plt.subplots_adjust(wspace=0.8,hspace=0.8)
plt.tight_layout()
plt.show()
Calculating the population density = number of people per square area (excluding areas of water)
df_train['population_density'] = df_train['pop']/df_train['ALand']
df_test['population_density'] = df_test['pop']/df_test['ALand']
plt.figure(figsize=(9,5))
sns.distplot(df_train['population_density'])
plt.title('Population Density')
plt.show()
df_train['age_median']=(df_train['male_age_median']+df_train['female_age_median'])/2
df_test['age_median']=(df_test['male_age_median']+df_test['female_age_median'])/2
df_train[['male_age_median','female_age_median','male_pop','female_pop','age_median']].head()
# plotting the median age
plt.figure(figsize=(9,5))
sns.distplot(df_train['age_median'])
plt.title('Median Age of the population')
plt.show()
The age of the population is mostly between 20 to 60 years with peak age around 39 years old. The median age distribution is slightly positively skewed.
Plotting the median age in a boxplot
plt.figure(figsize=(10,5))
sns.boxplot(df_train['age_median'], width=0.5, color='lightseagreen' )
plt.title("Box plot of the population's median age")
plt.show()
df_train['pop'].describe()
df_train['pop_bins']=pd.cut(df_train['pop'], bins=5, labels=['very low','low','medium','high','very high'])
df_train[['pop','pop_bins']]
df_train['pop_bins'].value_counts()
df_train.groupby(by='pop_bins')[['married','separated','divorced']].count()
df_train.groupby(by='pop_bins')[['married','separated','divorced']].agg(["mean", "median"])
The very high population group has more married people followed by the medium group while the very low to low population groups have more divorcees and separated people than other groups.
pop_bin_agg=df_train.groupby(by='pop_bins')[['married','separated','divorced']].agg(["mean"])
x_axis=['very low', 'low', 'medium', 'high', 'very high']
plt.figure(figsize=(17,5))
bp=plt.plot(x_axis, pop_bin_agg, marker='*', lw=2.2)
plt.legend(bp[:3],['Married','Separated','Divorced'],loc='best')
plt.title('Average of married, separated, and divorced populace')
plt.xlabel('population group')
plt.ylabel('Average')
plt.grid(linewidth=0.7)
plt.show()
rent_state_mean=df_train.groupby(by='state')['rent_mean'].agg(["mean"])
rent_state_mean.head()
income_state_mean=df_train.groupby(by='state')['family_mean'].agg(["mean"])
income_state_mean.head()
rent_pct_income=rent_state_mean['mean']/income_state_mean['mean']
rent_pct_income.head(10)
Calculating the rent as a percentage of family income:
df_train['rent_mean'].sum()/df_train['family_mean'].sum()*100
cor=df_train[['COUNTYID','STATEID','zip_code','type','pop', 'family_mean',
'second_mortgage', 'home_equity', 'debt','hs_degree',
'age_median','pct_own', 'married','separated', 'divorced']].corr()
plt.figure(figsize=(20,10))
sns.heatmap(cor, annot=True,cmap='viridis')
plt.show()
There is a significant positive correlation between high school degree and family income There is a significant positive correlation between percentage ownership and married There is an average correlation between median age and percentage ownership
# confirming the relationship with scatter plot
plt.figure(figsize=(9,5))
plt.scatter(df_train['pct_own'], df_train['married'], edgecolor='black')
plt.xlabel('percentage ownership')
plt.ylabel('married')
plt.show()
1. The economic multivariate data has a significant number of measured variables. The goal is to find where the measured variables depend on a number of smaller unobserved common factors or latent variables.
2. Each variable is assumed to be dependent upon a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as “specific variance” because it is specific to one variable. Obtain the common factors and then plot the loadings.
Use factor analysis to find latent variables in our dataset and gain insight into the linear relationships in the data. Following are the list of latent variables:
#!pip install factor_analyzer
from sklearn.decomposition import FactorAnalysis
from factor_analyzer import FactorAnalyzer
To perform factor analysis, we need to evaluate the “factorability” of our dataset.
Performing an adequecy test (Bartlett's Test of Sphericity):
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
df = df_train.select_dtypes(exclude= ('object','category'))
chi_square_value, p_value = calculate_bartlett_sphericity(df)
chi_square_value, p_value
In this Bartlett ’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.
# create factor analyzer object
fa=FactorAnalyzer(n_factors=5, rotation='varimax')
# fit the dataset
fa.fit(df_train.select_dtypes(exclude= ('object','category')))
# get the loadings
loadings = fa.loadings_
# get the eigenvectors and eigenvalues
eigen_values, vectors = fa.get_eigenvalues()
# do a scree plot to check if 5 factors are sufficient
# determine the x axis
xvals = range(1, df_train.select_dtypes(exclude= ('object','category')).shape[1]+1)
# scree plot
#plot a line plot to visualize how the eigenvalues change as added components increase
plt.figure(figsize=(10,5))
plt.scatter(xvals, eigen_values)
plt.plot(xvals, eigen_values, linewidth=2.5)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid(b=True)
plt.show()
From the scree plot, we can see that the number of eigenvalues greater than one which should be considered is about 9. Hence our guess of setting n_factors as 5 was incorrect. We'll reinitialize the FactorAnalyser object with number of factors i.e. n_factors = 9 before continuing to the next step
# create factor analyzer object with 9 factors
fa1=FactorAnalyzer(n_factors=9, rotation='varimax')
fa1.fit(df_train.select_dtypes(exclude= ('object','category')))
loadings = fa1.loadings_
pd.DataFrame.from_records(loadings)
The values of the loadngs show the weight or importance of each of the factor against each of the features being considered. Specific factors have high loading values for specific variables.
Displaying the variable names against corresponding loading values:
df_factors = pd.DataFrame.from_records(loadings)
var_list = df_train.select_dtypes(exclude= ('object','category')).columns
df_factors.index = var_list
df_factors.head()
# to display all rows i.e. variables
pd.set_option('display.max_rows', 100)
df_factors.head(100)
df_factors.loc[df_factors[0] >=0.5]
df_factors.loc[df_factors[1] >=0.5]
df_factors.loc[df_factors[2] >=0.5]
df_factors.loc[df_factors[3] >=0.5]
df_factors.loc[df_factors[4] >=0.5]
df_factors.loc[df_factors[5] >=0.5]
df_factors.loc[df_factors[6] >=0.5]
df_factors.loc[df_factors[7] >=0.5]
df_factors.loc[df_factors[8] >=0.5]
# get variance of each of the factors
fa1.get_factor_variance()
arr = fa1.get_factor_variance()
df_variance = pd.DataFrame(arr, columns =['Factor 0', 'Factor 1', 'Factor 2', 'Factor 3', 'Factor 4','Factor 5','Factor 6','Factor 7','Factor 8'])
df_variance.index = ['SS Loadings','Proportion Var','Cumulative Var']
df_variance
~71% cumulative Variance is explained by the 9 factors.
please refer - ‘deplotment_RE.xlsx’.
Column hc_mortgage_mean is predicted variable. This is mean monthly mortgage and owner costs of specified geographical location.
Note: Exclude loans from prediction model which have NaN values for hc_mortgage_mean. NaN represents not a number/missing values.
df_train.columns
# converting the geo loc type to numeric
df_train['type'].unique()
type_dict={'type':{'City':1,
'Urban':2,
'Town':3,
'CDP':4,
'Village':5,
'Borough':6}
}
df_train.replace(type_dict,inplace=True)
df_train['type'].unique()
df_test.replace(type_dict,inplace=True)
df_test['type'].unique()
feature_cols=['COUNTYID','STATEID','zip_code','type','pop', 'family_mean',
'second_mortgage', 'home_equity', 'debt','hs_degree',
'age_median','pct_own', 'married','separated', 'divorced']
x_train=df_train[feature_cols]
y_train=df_train['hc_mortgage_mean']
x_test=df_test[feature_cols]
y_test=df_test['hc_mortgage_mean']
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error,accuracy_score
x_train.head()
sc=StandardScaler()
x_train_scaled=sc.fit_transform(x_train)
x_test_scaled=sc.fit_transform(x_test)
linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)
y_pred=linereg.predict(x_test_scaled)
print("R2 score of the linear regression model:", r2_score(y_test,y_pred))
print("RMSE of the linear regression model:", np.sqrt(mean_squared_error(y_test,y_pred)))
The Accuracy and R2 score of 73% which is greater than the 60% benchmark was achieved, but let's compare the model result at state level
state=df_train['STATEID'].unique()
# selecting a few IDs
state[0:5]
for i in [20,1,45]:
print("State ID -",i)
x_train_nation=df_train[df_train['COUNTYID']==i][feature_cols]
y_train_nation=df_train[df_train['COUNTYID']==i]['hc_mortgage_mean']
x_test_nation=df_test[df_test['COUNTYID']==i][feature_cols]
y_test_nation=df_test[df_test['COUNTYID']==i]['hc_mortgage_mean']
x_train_scaled_nation=sc.fit_transform(x_train_nation)
x_test_scaled_nation=sc.fit_transform(x_test_nation)
linereg.fit(x_train_scaled_nation,y_train_nation)
y_pred_nation=linereg.predict(x_test_scaled_nation)
print("R2 score of linear regression model for state,",i,"is" ,r2_score(y_test_nation,y_pred_nation))
print("RMSE of linear regression model for state,",i,"is" ,np.sqrt(mean_squared_error(y_test_nation,y_pred_nation)))
print("\n")
# check the residuals
residuals=y_test-y_pred
residuals
# plotting the distribution of residuals:
plt.figure(figsize=(9,5))
plt.hist(residuals)
# Examine the normal plot of the residuals
plt.figure(figsize=(9,5))
sns.distplot(residuals)
# residual plot - independence
plt.figure(figsize=(9,5))
plt.scatter(residuals,y_pred, edgecolor='black')
Link to Tableau part of the project: https://public.tableau.com/views/CaptsoneProject1_RealEstateAnalysis/Realestateanalysis?:language=en&:display_count=y&:origin=viz_share_link